Incorporating Job Migration and Network RAM to Share Cluster Memory Resources
نویسندگان
چکیده
Job migrations and network RAM are two major approaches for effectively using global memory resources in a workstation cluster, aimed at reducing page faults in each local workstation and improving the overall performance of cluster computing. Using either remote executions or preemptive migrations, a load sharing system is able to migrate a job from a workstation without sufficient memory space to a lightly loaded workstation with large idle memory space for the migrated job. In a network RAM system, if a job cannot find sufficient memory space for its working sets, it will utilize idle memory space from other workstations in the cluster through remote paging. Conducting tracedriven simulations, we have compared the performance and trade-offs of the two approaches and their impacts on job execution time and cluster scalability. Our study indicates that job-migration-based load sharing schemes are able to balance executions of jobs in a cluster well, while network RAM is able to satisfy data-intensive jobs which may not be migratable by sharing all the idle memory resources in a cluster. We also show that a network RAM cluster of workstations is scalable only if the network is sufficiently fast. Finally, we propose an improved load sharing scheme by combining job migrations with network RAM for cluster computing. This scheme uses remote execution to initially allocate a job to the most lightly loaded workstation and, if necessary, network RAM to provide a larger memory space for the job than would be available otherwise. The improved scheme has the merits of both job migrations and network RAM. Our experiments show its effectiveness and scalability for cluster computing. This work is supported in part by the National Science Foundation under grants CCR-9400719, CCR-9812187, and EIA-9977030, by the Air Force Office of Scientific Research under grant AFOSR-95-1-0215, and by Sun Microsystems under grant EDUE-NAFO-980405.
منابع مشابه
7. Conclusion
A high-throughput computing policy attempts to efficiently utilize all of the resources in a cluster to generate faster overall execution. It is acceptable for jobs running on a workstation serving remote pages to execute more slowly, if other jobs benefit by achieving faster execution times through the use of network RAM, as long as throughput is maximized. Further work must be done to fully u...
متن کاملAdaptive and Virtual Reconfigurations for Effective Dynamic Job Scheduling in Cluster Systems
In a cluster system with dynamic load sharing support, a job submission or migration to a workstation is determined by the availability of CPU and memory resources of the workstation at the time [3]. In such a system, a small number of running jobs with unexpectedly large memory allocation requirements may significantly increase the queuing delay times of the rest of jobs with normal memory req...
متن کاملNetwork-aware selective job checkpoint and migration to enhance co-allocation in multi-cluster systems
Multi-site parallel job schedulers can improve average job turn-around time by making use of fragmented node resources available throughout the grid. By mapping jobs across potentially many clusters, jobs that would otherwise wait in the queue for local resources can begin execution much earlier; thereby improving system utilization and reducing average queue waiting time. Recent research in th...
متن کاملBoosting Performance for I/O-Intensive Workload by Preemptive Job Migrations in a Cluster System
Load balancing in a cluster system has been investigated extensively, mainly focusing on the effective usage of global CPU and memory resources. However, if a significant portion of applications running in the system is I/O-intensive, traditional load balancing policies that focus on CPU and memory usage may cause the system performance to decrease substantially. To solve this problem, a new I/...
متن کاملReliability for Network Swapping Systems That Support Migration of Remotely Swapped Pages
Network swapping systems allow individual cluster nodes with over-committed memory to use the idle memory of remote nodes as their backing store, and to swap their pages over the network. As the number of nodes in a cluster increases, it becomes more likely that a node will fail or become unreachable, making it important that such a system provide reliability support. Without reliability, a sin...
متن کامل